Skip to main content

Vi-En MT Experiment

To evaluate a machine translation (Vietnamese to English) model trained from scratch
Created on January 14|Last edited on February 24


1. Set up:

Training on Google Colab (1 P100 GPU with 16GB memory)
(*Update 1: Now using 1 RTX-3090 with 24GB memory rented on vast.ai)

2. Data:

2.1. Download data:

This include my dataset (train/valid/test) crawled from different sources, and extra test sets in the i2r_test_data folder

2.2. Process data:

Lowercase all the texts
Apply Moses (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) tokenizer on both the Vietnamese and English parallel corpus (*Note: Moses does not support vi, so en rule were applied to both the corpus, which should be ok)
Apply BPE (https://github.com/rsennrich/subword-nmt.git) with a vocab size of 32000
The binarized processed data used for training can be found here: 
(*Update: inlcuding processed and processed_joined_dictionary)

3. Training:

Using fairseq library (https://github.com/pytorch/fairseq)
Configuration details for each experiment can be checked in the respective wandb tab. The graph below shows the evaluation on the validation dataset during training:

Run set
18


3.1. Experiment 1

(*Note: include both 1a and 1b)
Model: transformer_wmt_en_de
Default settings
!CUDA_VISIBLE_DEVICES=0 fairseq-train \
$DATA_DIR \
--task translation \
--arch transformer_wmt_en_de --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 4096 \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--no-progress-bar --log-format simple --keep-interval-updates 20 \
--save-dir $CHECKPOINT_DIR \
--update-freq 8 --max-epoch 5 \
--wandb-project "Vi to En Translation"
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/1dAZYj_Vk0vY1FV7U6B0vIBC_hUENDXRS?usp=sharing

3.2. Experiment 2

Model: fconv_wmt_en_de
!CUDA_VISIBLE_DEVICES=0 fairseq-train \
$DATA_DIR \
--task translation \
--arch fconv_wmt_en_de \
--dropout 0.2 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer nag --clip-norm 0.1 \
--lr 0.5 --lr-scheduler fixed --force-anneal 50 \
--max-tokens 4096 \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--no-progress-bar --log-format simple --keep-interval-updates 20 \
--save-dir $CHECKPOINT_DIR \
--update-freq 8 --max-epoch 5 \
--wandb-project "Vi to En Translation"
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/1SOgDxPJIoWdBErmQKL2KyoZ7EixZJjgQ?usp=sharing

3.3. Experiment 3

Same as "Experiment 1", but used a joined-dictionary during data preprocessing instead.
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/11PzqB6IMKxZFcmD8A0Krnhv6cCOEVj89?usp=sharing

3.4. Experiment 4

Same as "Experiment 3", but used mixed precision training fp-16, so I could increase max token size from 4096 to 10000. Also increase lr from 5e-4 to 1e-3 and increase number of epochs from 5 to 12.
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/1SMpwdyguPYRKptnkfAbYXmU9ulbCAOl-?usp=sharing

3.5. Experiment 5

Same as "Experiment 3", but used transformer_wmt_en_de_big
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/1fYU2uvNWQoSpRNH2SUvDC_HS9ov4Inbb?usp=sharing

3.6. Experiment 6

(*Note: include both 6a and 6b)
Same as "Experiment 1", but from here onwards switch to another GPU RTX 3090.
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/1ao4duvdUC6Q3KkHE79O8-SysfRLKwyd8?usp=sharing

3.7. Experiment 7

Same as Experiment 4, but no longer used joined-dictionary
It seems like using joined-dictionary doesn't affect the results
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/1mrqTqfGozHn4DCkrwkkpoyIDxlEK7_Dj?usp=sharing

3.8. Experiment 8

Model: bart_base
!CUDA_VISIBLE_DEVICES=0 fairseq-train \
$DATA_DIR \
--task translation \
--arch bart_base --layernorm-embedding --share-all-embeddings --share-decoder-input-output-embed \
--optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.1 \
--lr-scheduler polynomial_decay --lr 3e-04 --warmup-updates 2500 \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 10000 \
--eval-bleu \
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
--eval-bleu-detok moses \
--eval-bleu-remove-bpe \
--eval-bleu-print-samples \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
--no-progress-bar --log-format simple --keep-interval-updates 20 \
--save-dir $CHECKPOINT_DIR \
--update-freq 8 --max-epoch 12 \
--fp16 \
--wandb-project "Vi to En Translation"
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/10eqZqQ6l00_sUTNB5-ufYrLyEU1u4FhH?usp=sharing

3.9. Experiment 9:

Continue experiment_8 to reach a total of 24 epochs.
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/19d3tPluI0VhJAz-xCKhv7OJb9o3XJ2He?usp=sharing

3.10. Experiment 10:

Continue experiment_9 to reach a total of 30 epochs.
Get more training data (> 2 million paired translations) from this paper: https://github.com/vietai/SAT, saved as train1 in folder processed_joined_dictionary
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/1qZpmUomTQvHm_iBxeic2GClLM5mq0YEL?usp=sharing

3.11. Experiment 11:

(*Note: include both 11a and 11b)
Continue experiment_11 to reach a total of 40 epochs.
Model checkpoints and predictions can be found here: https://drive.google.com/drive/folders/1U-MqIfnQLWIq-TdDJWgHewRDoFSwQaNS?usp=sharing

3.13. Experiment 13:

(*Update 1: change from v1 to v2 for the additional dataset from vietai/SAT)
Train a en2vi translation model as experiment_12 for 25 epochs, model checkpoints can be found here: https://drive.google.com/drive/folders/1zGzz4JNmV7CLW5f9J50PHzL3MPrz3BDb?usp=sharing
Use the en2vi model to translate monolingual English news text (~3 million) (original source from here: https://wortschatz.uni-leipzig.de/en/download/English). The English dataset after cleaning can be found here: https://drive.google.com/file/d/1-2xP8md5ILacc9Fy4S02s8e7DQ2dT15G/view?usp=sharing
Add this additional paired translations to the final training set and train a vi2en model. Same as experiment_8 with a total of 25 epochs.
Model checkpoints and predictions (vi2en) can be found here: https://drive.google.com/drive/folders/1sOKqKOvTufTZKg4-bvhDul_Bgv5S6e7b?usp=sharing
(*Update 2: For evaluation, use average model from the last checkpoints instead of checkpoint_best)

4. Evaluation:

Using sacrebleu library (https://github.com/mjpost/sacrebleu) on the detokenized lowercased texts. Beam size is default to be 5. All predictions are made using checkpoint_best.pt, which has the highest BLEU score on the validation set. Use the .lc.detok reference files.
%%bash
!fairseq-generate --fp16 \
$BINARIZED_TEST_FILE \
--path $CHECKPOINT_FILE \
--max-tokens 4096 --beam 5 \
--source-lang vi --target-lang en --moses-no-dash-splits --tokenizer moses \
--bpe subword_nmt --bpe-codes $BPE_FILE \
--remove-bpe \
| tee gen.out

grep ^D gen.out | cut -f3- > $TRANSLATION_FILE

sacrebleu \
$REFERENCE_FILE \
--input $TRANSLATION_FILE \
--metrics {bleu,chrf,ter} \
--lowercase --chrf-lowercase \
--confidence > $SCORE_FILE
Predictions from a baseline using Google Translate can also be found here: https://drive.google.com/drive/folders/14XvthBvpEvbRi4NTzIySfyD93unufbCI?usp=sharing
More detailed scores can be found in the respective Google Drive folders. A brief table can be shown below:
arch/BLEU4testtest.i2rtest.own
google-translate38.949.243.3
experiment_131.734.637.8
experiment_230.233.837.1
experiment_333.736.639.8
experiment_434.635.940.5
experiment_531.634.937.7
experiment_631.935.337.8
experiment_734.036.640.3
experiment_838.836.542.4
experiment_943.537.343.1
experiment_1044.241.345.5
experiment_1147.240.747.3
experiment_1343.843.147.7
experiment_1436.742.0

(*Note: test.own is my own test dataset, while the other two are the given external test datasets.)
(*Update 1: switch from google-translate on Google Sheets to Google Translate website; BLEU4 increased)
(*Update 2: add --moses-no-dash-splits tag)

5. Side experiments:

5.1 Beam size:

Use experiment_8 checkpoints
beam/BLEU4testtest.i2rtest.own
537.335.642.4
1037.635.942.5
2037.736.142.7
5037.836.0


5.2. Reference files:

Use experiment_11 checkpoints and google-translate, .lc.detok test.i2r reference files
ref/BLEU4experiment_11google-translateexperiment_10
032.739.633.2
128.434.129.0
228.433.228.5
326.231.426.6
0&136.144.036.7
2&332.839.533.6
0&1&2&340.148.640.9


5.3. Manual inspections:

Compare experiment_13 and google-translate outputs on the test.i2r test set. src is the original Vietnamese sentence taken from test.i2r.vi, ref is taken from test.i2r.en.0, gtt is from Google Translate, and own is translation from experiment_13 best checkpoint.

idx: 524
src: Mức lương thứ 3 là phổ biến nhất, đang được áp dụng cho các ngân hàng nhỏ hơn, khoảng 10.000 đô la một tháng.
ref: The third, the most popular salary level, is being applied to smaller banks, about $10,000 a month.
gtt: The 3rd salary is the most common, being applied to smaller banks, around $10,000 a month.
own: the 3rd salary is the most common, which is being applied to smaller banks, at about $10,000 a month.
Both google translate and my model gave similar translations which are valid, but "3rd salary level" would be a more correct translation.
TP = -0.446
D-TP = -0.455
D-Lex-Sim = 1.000

idx: 815
src: Erasmus Mundus là chương trình học bổng lớn của Liên hiệp châu Âu giúp sinh viên xuất sắc có cơ hội được học tập tại những trường đại học của châu Âu.
ref: Erasmus Mundus is a big scholarship program of the Europe Union which helps outstanding students have the opportunity to study at the European universities.
gtt: Erasmus Mundus is a major European Union scholarship program that helps excellent students have the opportunity to study at European universities.
own: erasmus mundus is the european union's major scholarship program that gives outstanding students the opportunity to study at the universities of europe.
Both google translate and my model gave similar translations which are valid.
TP = -0.608
D-TP = -0.609
D-Lex-Sim = 0.675

idx: 1279
src: Yoon Eun Hye vừa trở thành gương mặt trang bìa cho tạp chí High Cut với hình ảnh người phụ nữ hiện đại, thành thị.
ref: Yoon Eun Hye graced High Cut magazine not long ago with the image of a modern and urban lady.
gtt: Yoon Eun Hye has just become the cover face for High Cut magazine with the image of a modern, urban woman.
own: yoon eun hye just became the cover face of high cut magazine with the image of a modern, urban woman.
Both google translate and my model gave similar translations which are valid.
TP = -0.539
D-TP = -0.553
D-Lex-Sim = 0.640

idx: 1022
src: George Skinner, bố của Alice, cho biết: "Có nhiều lúc chúng tôi nghĩ rằng con gái mình sẽ không qua khỏi".
ref: George Skinner, Alice's father, said: "We have thought many times that our daughter was't able to last out".
gtt: George Skinner, Alice's father, said: "There were times when we thought our daughter was not going to make it."
own: "there were times when we thought our daughter was not going to make it," said george skinner, alice's father.
Both google translate and my model gave similar translations which are valid, but my model rearrange the order of the sentence.
TP = -0.309
D-TP = -0.316
D-Lex-Sim = 0.786

idx: 1919
src: Ngoài Angimex, năm ngoái Công ty cổ phần Bảo vệ thực vật An Giang đã xuất khẩu hơn 300 tấn gạo tấm sang Nhật.
ref: In addition to Angimex, An Giang Plant Protection Joint Stock Company last year exported over 300 tons of broken rice to Japan.
gtt: Besides Angimex, last year An Giang Plant Protection Joint Stock Company exported more than 300 tons of broken rice to Japan.
own: in addition to angimex, last year an giang plant protection jsc exported more than 300 tons of broken rice to japan.
Both google translate and my model gave similar translations which are valid, but my model replaced "Joint Stock Company" to the short form "jsc".
TP = -0.413
D-TP = -0.428
D-Lex-Sim = 0.662

idx: 1935
src: Ông Tuyên nói, "Nếu được phê duyệt, kế hoạch này sẽ sớm được thực hiện".
ref: "If approved, this plan will soon be implemented," said Tuyen.
gtt: "If approved, this plan will soon be implemented," Mr. Tuyen said.
own: if approved, the plan will be implemented soon, "tuyen said.
Both google translate and my model gave similar translations which are valid, but my model use punctuation inccrrectly (incomplete open-close quotation marks).
TP = -0.463
D-TP = -0.489
D-Lex-Sim = 0.769

idx: 408
src: Nhà sản xuất xe hơi tại Nhật Bản đã buộc phải hạn chế sản lượng do hậu quả của trận động đất và sóng thần.
ref: Japanese car manufacturers were forced to curb output in the aftermath of the earthquake and tsunami.
gtt: The Japanese car maker was forced to limit output as a result of the earthquake and tsunami.
own: the car manufacturer in japan was forced to limit production as a result of the earthquake and tsunami.
Both google translate and my model gave similar translations which are valid, but my model's translation "the car manufacturer in japan" is more exactly corresponding to the Vietnamese source.
TP = -0.410
D-TP = -0.407
D-Lex-Sim = 0.800

idx: 1589
src: Đoàn Việt Nam sẽ tham gia vào các hoạt động của Hội nghị Cấp cao ASEM 9 lần này với tinh thần chủ động, tích cực và có trách nhiệm, nhằm đóng góp thiết thực vào việc thúc đẩy hợp tác Á-Âu.
ref: The Vietnamese delegation will take part in the 9th ASEM Summit's activities with a high sense of activeness, positiveness and responsibility to practically contribute in fostering Asia - Europe cooperation.
gtt: The Vietnamese delegation will participate in activities of this 9th ASEM Summit with a proactive, positive and responsible spirit, in order to make practical contributions to promoting Asia-Europe cooperation.
own: the vietnamese delegation will participate in the activities of this nine-time asem summit in a proactive, positive and responsible spirit, aiming to make a practical contribution to the promotion of eurasian cooperation.
My model incorrectly translated "Á-Âu" into "eurasian" instead of "Asia-Europe". Also, "nine-time" is strange (should be ninth time or 9th, etc.). The rest looks valid.
TP = -0.554
D-TP = -0.580
D-Lex-Sim = 0.648

idx: 1312
src: Ngày 5 tháng 12 năm 2007, các đại diện của ITA tại Hà Nội đã có chuyến viếng thăm trường tiểu học Sông Giang để trao tặng 17 triệu đồng cho 5 giáo viên của trường này.
ref: On December 5, 2007, ITA's representatives in Hanoi paid a visit to Song Giang primary school to hand over VND17 million to the 5 mentioned school teachers.
gtt: On December 5, 2007, representatives of ITA in Hanoi visited Song Giang primary school to donate VND 17 million to 5 teachers of this school.
own: on 5 december 2007, representatives of ita in hanoi visited the river giang primary school to award vnd17 million to five teachers of the school.
My model have a strange format of the date "5 december 2007" (should be december 5th 2007 or 5/12/2017, etc.). It also translated "trường tiểu học Sông Giang" into "river giang primary school", but "Song Giang" is the name of the school and should have been left unchanged. The rest looks valid.
TP = -0.528
D-TP = -0.521
D-Lex-Sim = 0.549

idx: 1616
src: Ngày 25-11, đại biểu Quốc hội Thành phố Hồ Chí Minh do Chủ tịch nước Trương Tấn Sang dẫn đầu đã có cuộc tiếp xúc cử tri tại quận 3 và quận 4.
ref: Voters in district 3 and district 4 met with Ho Chi Minh City National Assembly deputies led by state President Truong Tan Sang on November 25.
gtt: On November 25, a member of the National Assembly of Ho Chi Minh City led by President Truong Tan Sang had a meeting with voters in District 3 and District 4.
own: on november 25, ho chi minh city's national assembly member, led by state president truong tan, met with voters in district 3 and district 4.
My model did not complete the president's name "truong tan sang". Both my model and google translate used singular "member", although in this context it should be "members" (plural). The rest looks valid.
TP = -0.449
D-TP = -0.465
D-Lex-Sim = 0.755

idx: 1055
src: Thủ đô Kula Lumpur hấp dẫn người nước ngoài vì vẻ đẹp hiện đại của những khu nhà chọc trời và trung tâm mua sắm lớn, hoà lẫn với nét dân dã của những ngôi làng cổ và khu phố ẩm thực Hoa, Ấn.
ref: Kula Lumpur capital attracts foreigners because of modern beauty of skyscrapers and large shopping centers, mixing with rustic accent villages and Chinese and Indian food streets.
gtt: The capital, Kula Lumpur, attracts foreigners because of the modern beauty of skyscrapers and large shopping centers, mixed with the rustic features of old villages and Chinese and Indian food streets.
own: the capital city of kula lumpur is attractive to foreigners for the modern beauty of large skyscrapers and shopping malls, mingling with the savagery of ancient villages and chinatown.
My model made a mistake with "the capital city of kula lumpur" (should remove "of"). "savagery of ancient village" also sounds wrong as "savagery" has a negative connotation. It also translated "khu phố ẩm thực Hoa, Ấn" incorrectly into "chinatown" (missing the words "ẩm thực" which means "food" and "Ấn" which means "Indian").
TP = -0.694
D-TP = -0.716
D-Lex-Sim = 0.614

idx: 825
src: Ứng viên trúng tuyển sẽ lên đường đi học vào tháng 7-8 năm 2012 và phải hoàn thành khóa học hoặc phải hoàn trả toàn bộ kinh phí cho Chính phủ Brunei.
ref: Successful candidates will start to study abroad in July-August 2012 and have to finish the course or pay back complete school fee for Brunei Government.
gtt: Successful candidates will be leaving for school in July-August 2012 and must complete the course or fully reimburse the Government of Brunei.
own: the matriculating applicant is due to leave for school in july – august 2012 and must complete the course or repay all funds to the brunei government.
Both google translate and my model gave similar translations which are valid.
TP = -0.485
D-TP = -0.492
D-Lex-Sim = 0.758

idx: 379
src: Được thành lập từ năm 1905, NUS là viện đại học đa ngành lớn nhất trong ba viện đại học công lập của Singapore.
ref: The NUS was established in 1905 and is one of the three largest multi-disciplinary public institutions in Singapore.
gtt: Established in 1905, NUS is the largest multidisciplinary university of the three public universities in Singapore.
own: founded in 1905, nus is the largest of the three public universities in singapore.
Both google translate and my model gave similar translations which are valid, but my model missed the word "đa ngành" - "multidisciplinary".
TP = -0.317
D-TP = -0.321
D-Lex-Sim = 0.991

idx: 373
src: Trại hè quốc tế NUS lần thứ 7 tự tin sẽ thu hút được đông đảo sự quan tâm của giới học sinh sinh viên Việt Nam và các nước.
ref: The 7th NUS international summer camp will confidently attract a lot of attention in student circles of Vietnam and others.
gtt: The 7th NUS International Summer Camp is confident to attract a lot of attention from Vietnamese and international students.
own: the confident 7th nus international summer camp will attract a large amount of attention from vietnamese students and other countries.
My model put the word "confident" at a wrong position. The last phrase "vietnamese students and other countries" is also strange (should be "students from other countries, etc.)
TP = -0.867
D-TP = -0.867
D-Lex-Sim = 0.638

idx: 1998
src: Các chuỗi siêu thị tại TPHCM đã đưa ra các chương trình khuyến mãi kéo dài cho đến Tết Âm Lịch để kích cầu.
ref: Supermarket chains in HCMC have launched sales promotion programs lasting until the Lunar New Year in order to stimulate demand.
gtt: Supermarket chains in Ho Chi Minh City have launched promotions until the Lunar New Year to stimulate demand.
own: supermarket chains at tphcm have launched promotions that last until the lunar new year to jack up demand.
My model did not change "tphcm" into something else in English as it is short form for "thành phố Hồ Chí Minh". It also used a strange phrase "jack up demand".
TP = -0.485
D-TP = -0.492
D-Lex-Sim = 0.758